Special Linguistic Phenomena in the Bulgarian HPSG-based Treebank (BulTreeBank)
نویسنده
چکیده
Currently the BuTreeBank comprises 214 000 tokens, a little more than 15 000 sentences. Each token is annotated with morphosyntactic information. Additionally the Named Entities are annotated with ontological classes as person, organization, location, and other. Based on HPSG theory the annotation scheme defines a number of phrase types which reflect both the constituent structure and the head-dependant relation. Thus we have phrase labels with the explication of the dependant types like VPC (verbal head complement phrase), VPS (verbal head subject phrase), VPA (verbal head adjunct phrase), NPA (nominal head adjunct phrase) etc. Behind the constituent structures and the head-dependant relations the treebank also represents phenomena like coordination, ellipsis, pro-dropness, word order, secondary predication, control – see (Simov and Osenova 2003). We will focus on some of them in this demo presentation. The treebank is encoded in XML.
منابع مشابه
Practical Annotation Scheme for an HPSG Treebank of Bulgarian
The paper presents an HPSG-based annotation scheme for constructing a Bulgarian treebank: BulTreeBank. It differs from other grammar-based annotation schemes in having a hybrid status with respect to the partial parsing component and the full parsing module. As the parsing complexity is handled preferably by the pre-processing step, the task of the HPSG module is maximally facilitated and simpl...
متن کاملA Data-Driven Dependency Parser for Bulgarian
One of the main motivations for building treebanks is that they facilitate the development of syntactic parsers, by providing realistic data for evaluation as well as inductive learning. In this paper we present what we believe to be the first robust data-driven parser for Bulgarian, trained and evaluated on data from BulTreeBank (Simov et al., 2002). The parser uses dependency-based representa...
متن کاملConstituency Parsing of Bulgarian: Word- vs Class-based Parsing
In this paper, we report the obtained results of two constituency parsers trained with BulTreeBank, an HPSG-based treebank for Bulgarian. To reduce the data sparsity problem, we propose using the Brown word clustering to do an off-line clustering and map the words in the treebank to create a class-based treebank. e observations show that when the classes outnumber the POS tags, the results are...
متن کاملLanguage Resources and Tools for the Creation of a Bulgarian Treebank
This paper describes a framework for the creation of an HPSG-based treebank of Bulgarian. The architecture consists of several types of language resources and tools, such as gazetteers, a morphological dictionary, a valence dictionary, a semantic dictionary, named entities recognition grammars, chunk grammars for NPs and VPs, a general HPSG grammar. The paper describes each of them, including t...
متن کامل